Use more compact cache representation for int and str #19750

ilevkivskyi · 2025-08-28T10:48:11Z

After looking more at some real data I found that:

More than 99.9% of all ints are between -10 and 117. Values are a bit arbitrary TBH, the idea is that we should include small negative values (for TypeVarIds) and still be able to fit them in 1 byte.
More than 99.9% of strings are shorter that 128 bytes (again the idea is to fit the length into a single byte)

Note there are very few integers that would fit in two bytes currently. This is because we only store line for type alias nodes, and type aliases are usually defined at the top of a module. We can add special case for two bytes later when needed.

We could probably save another byte for long strings and medium integers, but I don't want to have anything fancy that would only affect less than 0.1% cases.

Finally you may notice I add a small correctness change I noticed accidentally when working on this, it is not really related, but it is so minor that it doesn't deserve a separate PR.

JukkaL

Nice -- did you check how much this reduces the size of binary cache files?

JukkaL · 2025-08-28T15:43:27Z

mypyc/test-data/run-classes.test

    write_int(b, 2 ** 85)
+    write_int(b, 255)
    write_int(b, -1)
+    write_int(b, -255)


Test also the edge cases (-11, -10, -9, 116, 117, 118). Test a few more different lengths of integers (e.g. 15 bits, 23 bits, 30 bits) with arbitrary lower bits.

OK, I think it would also make sense to test something like len(data.getvalue()) == 1 etc.

ilevkivskyi · 2025-08-28T22:35:40Z

@JukkaL

did you check how much this reduces the size of binary cache files?

It looks like it's ~40% smaller now, for example:

.mypy_cache/3.12/types.data.ff 183K (master)
.mypy_cache/3.12/types.data.ff 103K (this PR)
also btw .mypy_cache/3.12/types.data.json 381K

github-actions · 2025-08-28T23:12:10Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

ilevkivskyi added 4 commits August 28, 2025 00:30

Fix bug

33c0f3f

Optimize str len

5466462

Optimize ints

1ace62c

Tweak comments

2fe38d1

ilevkivskyi requested a review from JukkaL August 28, 2025 10:48

ilevkivskyi mentioned this pull request Aug 28, 2025

Incremental improvements to the fixed format serialization #19738

Closed

This comment has been minimized.

Sign in to view

JukkaL reviewed Aug 28, 2025

View reviewed changes

Add more tests

a909227

JukkaL approved these changes Aug 29, 2025

View reviewed changes

ilevkivskyi merged commit 6a88c21 into python:master Aug 29, 2025
20 checks passed

ilevkivskyi deleted the compact-int branch August 29, 2025 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use more compact cache representation for int and str #19750

Use more compact cache representation for int and str #19750

Uh oh!

ilevkivskyi commented Aug 28, 2025

Uh oh!

This comment has been minimized.

JukkaL left a comment

Uh oh!

JukkaL Aug 28, 2025

Uh oh!

ilevkivskyi Aug 28, 2025

Uh oh!

ilevkivskyi commented Aug 28, 2025

Uh oh!

github-actions bot commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Use more compact cache representation for int and str #19750

Use more compact cache representation for int and str #19750

Uh oh!

Conversation

ilevkivskyi commented Aug 28, 2025

Uh oh!

This comment has been minimized.

JukkaL left a comment

Choose a reason for hiding this comment

Uh oh!

JukkaL Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ilevkivskyi Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

ilevkivskyi commented Aug 28, 2025

Uh oh!

github-actions bot commented Aug 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants